In [2]:
# importing useful python packages
import sklearn
import numpy as np
import pandas as pd
import numpy.random as random
In supervised learning, the model defines the effect one set of observations, called inputs, has on another set of observations, called outputs. We are given a dataset with response variables and predictor variables and use this to try to come up with a function that maps predictor variables to response variables. We assume that there is some true mapping function $f$, and try to come up with an approximation $\hat f$ so that given new data, we can make accurate predictions.
Unsupervised learning is where you only have input data and no corresponding output variables. The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data. These are called unsupervised learning because unlike supervised learning above there is no correct answers and there is no teacher. Algorithms are left on their own to discover interesting structure and patterns in the data.
In a typical supervised learning problem, we have some number of predictors variables and a response (or outcome) variable. The outcome variable is the variable that we would like to predict. The predictor variables are the ones that we have at our disposal to try to determine the outcome variable. We usually label the number of predictor variables as $p$, and we call the $i$th instance of the $j$th predictor variable $X_{i,j}$. We call the $i$th instance of the response variable $Y_i$. Generally we write $Y = f(X) + \epsilon$, where $\epsilon$ represents the true error of the relationship between $X$ and $Y$, which is impossible to predict.
Summary of Notation:
$\hat{Y_i}$ is the predicted outcome from our model.
$Y_i$ is the true outcome.
$X_{i,j}$ is the jth predictor variable
We sometimes refer to the predictor variables collectively as a vector: $X_i$
There are two types of supervised learning problems: regression and classification. In regression, we have a continuous response variable, with ordering. In classification, we have discrete response variables, without ordering.
For example, in a regression problem, we might try to predict the price of a house given predictor variables such as size, and geographic location.
In a classification problem we might try to predict whether someone has a disease or doesn't have a disease based on various health metrics
Once we have a model, to evaluate how good it is we need some metric of performance. We call this metric the loss function. Below are two typical examples of loss functions, used for regression problems.
Residual Sum of Squares: $RSS = \sum_{i=1}^n{(y_i - \hat{y_i})^2}$
Mean Squared Error: $MSE = \frac{1}{n} \sum_{i=1}^n{(y_i - \hat{y_i})^2}$
Explain why the Residual Sum of Squares makes sense to evaluate performance for a regression problem
One way to validate our model is to split the dataset we're given into random subsets, training and test. This is to prevent overfitting, where our predictive model is taking into account fluctuations due to the irreducible error ($\epsilon$) in the dataset rather than solely the function $f$ that relates $X$ (predictors) and $Y$ (response). Because of this, it's important to validate our model by testing it on a dataset other than the training set.
What are the disadvantages of this approach?
Another way to select a model is to use metrics (loss functions) that penalize flexibility in the model. We'll talk more about this later.
In [12]:
dataset = pd.read_csv('dataset_3.txt')
dataset = dataset.values
random_indices = dataset[random.permutation(len(dataset))]
train = random_indices[:len(random_indices)/2]
test = random_indices[len(random_indices)/2:]
In [13]:
train
Out[13]:
In [14]:
test
Out[14]:
In [ ]: